Live freelance tracking. Raw descriptions turned into structured data. Find your next tech project without the noise.
upwork.com π’ 2026-05-18
πΉ Process and analyze a large database of Indian names
π€ Client: πΊπΈ USA Member since 2024-11-07
π° Price: $5.00-$60.00 Hourly
π© Problem: Need to clean, preprocess, and analyze a massive dataset of Indian names for demographic insights.
π¦ Existing: Not specified
Specifications:
[Target] Clean and pre-process raw .csv files with OCR and Unicode artifacts
[Method] Use Python (pandas/polars, regex, Unicode text handling)
[UI/UX] Not applicable
[Stack] Python, pandas, polars, regex, IndicXlit/AI4Bharat
[Security] Ensure data privacy and security during processing
[Format] Output structured JSON for further analysis
Workflow:
1. Import raw .csv files into a DataFrame using pandas or polars.
2. Handle OCR and Unicode artifacts by cleaning text data.
3. Transliterating names from Devanagari, Gurmukhi to Latin script using IndicXlit/AI4Bharat.
4. Normalize and standardize names according to pre-specified rules.
5. Extract personal names and surnames from full and parental name fields using rule-based parsing.
6. Implement existing ML algorithms for classifying and inferring religion based on names.
7. Construct frequency-based measures across age cohorts and geographies.
8. Output results in structured JSON format.